Introduction

This Dataset from the U.S. Small Business Administration (SBA) and a paper is published in the website below:

https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342#.W4TAIi2B00o

The dataset (UJSE_1434342_Supplementary_Files.zip) can be download from this website

SBA & loan guarantee program

  • Founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market.

  • The Loan guarantee program is to assist small business enterprises is through a loan guarantee program which is designed to encourage banks to grant loans to small businesses.

  • The loan money may use for creating job opportunities and reducing unemployment.

  • SBA acts much like an insurance provider to reduce the risk for a bank by taking on some of the risk through guaranteeing a portion of the loan. In the case that a loan goes into default, SBA then covers the amount they guaranteed.

  • Since SBA loans only guarantee a portion of the entire loan balance, banks will incur some losses if a small business defaults on its SBA-guaranteed loan.

!pip install --user geopandas==0.3.0 pyshp==1.2.10 shapely==1.6.3 plotly cufflinks==0.12.0

In [1]:
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))

from IPython.display import IFrame
In [2]:
import sys
sys.path.insert(0,'../')
from utils.paths import *
In [3]:
import pandas as pd
import numpy as np
from plotly.offline import init_notebook_mode, iplot
import cufflinks as cf
init_notebook_mode()
cf.go_offline()
from __future__ import division
In [4]:
# User define function

def table(no):
    # there are 13 additional table
    return pd.read_csv(path_SBA + 't00%02d'%no + '-10.1080%2F10691898.2018.1434342.csv')

def default_cat_table(data, cat):
    default_cat = data.groupby([cat, 'default']).count().max(1).unstack()
    default_cat['ALL'] = data[cat].value_counts()
    default_cat['default_rate'] = (default_cat[1] / default_cat['ALL'])
    default_cat = default_cat.rename(columns = {1: 'Default', 0: 'Non-default'})
    return default_cat
In [5]:
nat = pd.read_csv(path_SBA + 'SBAnational_new.csv', sep = ';', low_memory=False,
                 dtype = {'Zip'   : str,
                          'Zip3d' : str,
                          'Zip5d' : str,
                          'fips'  : str})

NB: This dataset has been cleaned and added a few features such as:

  • suffix: the last word of Name

  • Loan_age: Age of the company since the first loan

  • default_times: no. of default in the pass

In [6]:
nat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 897137 entries, 0 to 897136
Data columns (total 46 columns):
LoanNr_ChkDgt          897137 non-null int64
Name                   897137 non-null object
City                   897107 non-null object
State                  897137 non-null object
Zip                    897137 non-null object
Bank                   895631 non-null object
BankState              895624 non-null object
NAICS                  897137 non-null int64
ApprovalDate           897137 non-null object
ApprovalFY             897137 non-null int64
Term                   897137 non-null int64
NoEmp                  897137 non-null int64
NewExist               897003 non-null float64
CreateJob              897137 non-null int64
RetainedJob            897137 non-null int64
FranchiseCode          897137 non-null int64
UrbanRural             897137 non-null int64
RevLineCr              897137 non-null int64
LowDoc                 891138 non-null float64
ChgOffDate             162419 non-null object
DisbursementDate       894965 non-null object
DisbursementGross      897137 non-null float64
BalanceGross           897137 non-null float64
MIS_Status             897137 non-null object
ChgOffPrinGr           897137 non-null float64
GrAppv                 897137 non-null float64
SBA_Appv               897137 non-null float64
default                897137 non-null int64
Zip5d                  897137 non-null object
Zip3d                  897137 non-null object
SBA_ratio              897137 non-null float64
Zip_length             897137 non-null int64
RealEstate             897137 non-null int64
NAICS_default_rate     695491 non-null float64
NAICS_group            897137 non-null int64
Name2                  897137 non-null object
loan_start             897137 non-null int64
loan_record_dict       897137 non-null object
suffix                 897137 non-null object
Loan_age               897137 non-null int64
Previous_loan          897137 non-null int64
default_record_dict    162574 non-null object
default_times          897137 non-null int64
zip                    874130 non-null float64
fips                   874130 non-null object
BusinessType           705872 non-null object
dtypes: float64(10), int64(18), object(18)
memory usage: 314.9+ MB

Variable Name

  • LoanNr_ChkDgt: Identifier – Primary Key
  • Name: Borrower Name
  • City: Borrower City
  • State: Borrower State
  • Zip: Borrower Zip Code
  • Bank: Bank Name
  • BankState: Bank State
  • NAICS: North American Industry Classification System code
  • ApprovalDate: Date SBA Commitment Issued
  • ApprovalFY: Fiscal Year of Commitment
  • Term: Loan term in months
  • NoEmp: Number of Business Employees
  • NewExist: 1 = Existing Business, 2 = New Business
  • CreateJob: Number of jobs created
  • RetainedJob: Number of jobs retained
  • FranchiseCode: Franchise Code 00000 or 00001 = No Franchise
  • UrbanRural: 1= Urban, 2= Rural, 0 = Undefined
  • RevLineCr: Revolving Line of Credit : Y = Yes
  • LowDoc: LowDoc Loan Program: Y = Yes, N = No
  • ChgOffDate: The date when a loan is declared to be in default
  • DisbursementDate: Disbursement Date
  • DisbursementGross: Amount Disbursed
  • BalanceGross: Gross amount outstanding
  • MIS_Status: Loan Status
  • ChgOffPrinGr: Charged-off Amount
  • GrAppv: Gross Amount of Loan Approved by Bank
  • SBA_Appv: SBA’s Guaranteed Amount of Approved Loan
In [7]:
# Top banks
nat.Bank.value_counts().head()
Out[7]:
BANK OF AMERICA NATL ASSOC        86766
WELLS FARGO BANK NATL ASSOC       63459
JPMORGAN CHASE BANK NATL ASSOC    48128
U.S. BANK NATIONAL ASSOCIATION    35112
CITIZENS BANK NATL ASSOC          33770
Name: Bank, dtype: int64
In [8]:
# Top borrowers
nat.groupby('Name').GrAppv.sum().sort_values(ascending = False).head()
Out[8]:
Name
DAYS INN               240302066.0
HOLIDAY INN EXPRESS    230528070.0
COMFORT INN            213150659.0
SUPER 8 MOTEL          200078752.0
SUBWAY                 170836574.0
Name: GrAppv, dtype: float64
In [9]:
# Top default companies
nat[nat.default == 1].groupby('Name').GrAppv.sum().sort_values(ascending = False).head()
Out[9]:
Name
COLD STONE CREAMERY    27829902.0
SCHLOTZSKY'S DELI      15127350.0
QUIZNO'S SUBS          12763600.0
DAYS INN               12558000.0
PLANET BEACH            9318314.0
Name: GrAppv, dtype: float64
In [10]:
# Biggest loan (single)
nat[['Name', 'GrAppv']].sort_values('GrAppv', ascending = False).head()
Out[10]:
Name GrAppv
451084 Simon PLC Attorneys & Counselo 5472000.0
492585 CARTER BROTHERS, LLC 5000000.0
453977 SEATTLE SAFETY, LLC 5000000.0
593815 CP Franchising LLC 5000000.0
528775 Preston Consulting I, LLC 5000000.0
In [11]:
# Biggest loan (single)
nat[nat.default == 1][['Name', 'GrAppv']].sort_values('GrAppv', ascending = False).head()
Out[11]:
Name GrAppv
469950 DWG & ASSOCIATES, INC. 3500000.0
376402 Richard M. Spyhalski dba Signa 3411950.0
256511 FLORIDA COMPUTERIZED MACHINING 2900000.0
179544 DEQUINDRE PROPERTIES INC 2700000.0
391238 DAYS INN-CINCINNATI 2485000.0

Cases in year

In [12]:
default_cat_table(nat, 'ApprovalFY').ALL.iplot(kind = 'bar', title = 'SBA cases',
                                               yTitle = 'no. of cases', xTitle = 'Year')
In [13]:
default_cat_table(nat, 'ApprovalFY').loc[1990:2014].default_rate.iplot(kind = 'bar', title = 'SBA default rate',
                                               yTitle = 'Default rate', xTitle = 'Year')

NAICS group

In [14]:
nace_group = table(3)
nace_group['sector_group'] = nace_group.Sector.str[:2]
nace_group.head()
Out[14]:
Sector Description sector_group
0 11 Agriculture, forestry, fishing and hunting 11
1 21 Mining, quarrying, and oil and gas extraction 21
2 22 Utilities 22
3 23 Construction 23
4 31–33 Manufacturing 31
In [15]:
default_cat_table2 = default_cat_table(nat, 'NAICS_group').reset_index()

default_cat_table2['NAICS_group'] = default_cat_table2['NAICS_group'].astype(str)
default_cat_table2 = default_cat_table2.merge(nace_group, how = 'left', left_on = 'NAICS_group', right_on = 'sector_group')
use_cols = ['Description', 'NAICS_group', 'default_rate', 'Default', 'ALL']
default_cat_table2[use_cols]
Out[15]:
Description NAICS_group default_rate Default ALL
0 NaN 0 0.083230 16783 201646
1 Agriculture, forestry, fishing and hunting 11 0.090272 812 8995
2 Mining, quarrying, and oil and gas extraction 21 0.084819 157 1851
3 Utilities 22 0.141994 94 662
4 Construction 23 0.232558 15463 66491
5 Manufacturing 31 0.153722 10438 67902
6 Wholesale trade 42 0.194777 9480 48671
7 Retail trade 44 0.227346 28867 126974
8 Transportation and warehousing 48 0.265051 5939 22407
9 Information 51 0.248284 2821 11362
10 Finance and insurance 52 0.284266 2692 9470
11 Real estate and rental and leasing 53 0.287312 3904 13588
12 Professional, scientific, and technical services 54 0.190766 12957 67921
13 Management of companies and enterprises 55 0.101562 26 256
14 Administrative and support and waste managemen... 56 0.235513 7661 32529
15 Educational services 61 0.242462 1552 6401
16 Health care and social assistance 62 0.103793 5736 55264
17 Arts, entertainment, and recreation 71 0.206144 3013 14616
18 Accommodation and food services 72 0.220438 14882 67511
19 Other services (except public administration) 81 0.196552 14229 72393
20 Public administration 92 0.154185 35 227
In [16]:
default_cat_table2.set_index('Description').default_rate.sort_values().iplot(kind = 'bar', title = 'Default on sectors', 
                                                                            yTitle = 'Default rate')

Business type

In [17]:
default_cat_table(nat, 'BusinessType')
Out[17]:
default Non-default Default ALL default_rate
BusinessType
CORPORATION 386701 92635 479336 0.193257
INDIVIDUAL 157728 38687 196415 0.196966
PARTNERSHIP 26998 3123 30121 0.103682

Loan term

In [18]:
nat['Term_year'] = nat.Term.apply(lambda x: round(x/12))
In [19]:
default_cat_table(nat, 'Term_year').head()
Out[19]:
default Non-default Default ALL default_rate
Term_year
0.0 2489.0 7864.0 10353 0.759587
1.0 22221.0 14537.0 36758 0.395479
2.0 9413.0 16973.0 26386 0.643258
3.0 21374.0 22609.0 43983 0.514040
4.0 17435.0 24939.0 42374 0.588545

It appears that SBA may put term to 0 if the cases are default. Therefore it is not recommended to use this variable. However, this variable were used to create the other variable 'RealEstate' because banks only approve long loan if the company is an owner of properties.

Suffix

The last word of the company name may indicate the legal type / type of business

In [20]:
default_cat_table(nat, 'suffix').sort_values('default_rate', ascending = False).head(10)
Out[20]:
default Non-default Default ALL default_rate
suffix
LLC 59218 18110 77328 0.234197
LL 2102 611 2713 0.225212
CORP 16047 4271 20318 0.210208
ENTERPRISES 2353 621 2974 0.208810
CONSTRUCTION 2395 618 3013 0.205111
SALON 1558 402 1960 0.205102
IN 6161 1376 7537 0.182566
SERVICE 11988 2625 14613 0.179635
NO SUFFIX 349956 75430 425386 0.177321
RESTAURANT 7604 1583 9187 0.172309

FranchiseCode

In [21]:
default_cat_table(nat, 'FranchiseCode')
Out[21]:
default Non-default Default ALL default_rate
FranchiseCode
0 695724 149681 845405 0.177052
1 43872 7860 51732 0.151937

UrbanRural

In [22]:
default_cat_table(nat, 'UrbanRural')
Out[22]:
default Non-default Default ALL default_rate
UrbanRural
0 299840 22962 322802 0.071133
1 354409 114866 469275 0.244773
2 85347 19713 105060 0.187636

RevLineCr

Revolving line of credit

Revolving credit is a line of credit where the customer pays a commitment fee to a financial institution to borrow money, and is then allowed to use the funds when needed. It usually is used for operating purposes and the amount drawn can fluctuate each month depending on the customer's current cash flow needs. Revolving lines of credit can be taken out by corporations or individuals.

https://www.investopedia.com/terms/r/revolvingcredit.asp#ixzz5Vgppy7tP

In [23]:
default_cat_table(nat, 'RevLineCr')
Out[23]:
default Non-default Default ALL default_rate
RevLineCr
0 589927 106551 696478 0.152985
1 149669 50990 200659 0.254113

LowDoc

In [24]:
default_cat_table(nat, 'LowDoc')
Out[24]:
default Non-default Default ALL default_rate
LowDoc
0.0 634910 146060 780970 0.187024
1.0 100264 9904 110168 0.089899

Program for small loan that require much less documents

Past records

Past record was based on if that company has previous appearence in that dataset (e.g. previous loan, when did the first loan made and default history.)

In [25]:
default_cat_table(nat, 'Loan_age')
Out[25]:
default Non-default Default ALL default_rate
Loan_age
0 709753.0 153200.0 862953 0.177530
1 10115.0 1996.0 12111 0.164809
2 6560.0 915.0 7475 0.122408
3 4053.0 503.0 4556 0.110404
4 2685.0 306.0 2991 0.102307
5 1885.0 181.0 2066 0.087609
6 1316.0 130.0 1446 0.089903
7 895.0 93.0 988 0.094130
8 680.0 58.0 738 0.078591
9 438.0 48.0 486 0.098765
10 358.0 28.0 386 0.072539
11 248.0 23.0 271 0.084871
12 184.0 19.0 203 0.093596
13 142.0 16.0 158 0.101266
14 99.0 10.0 109 0.091743
15 59.0 4.0 63 0.063492
16 42.0 4.0 46 0.086957
17 30.0 1.0 31 0.032258
18 14.0 3.0 17 0.176471
19 13.0 1.0 14 0.071429
20 9.0 1.0 10 0.100000
21 5.0 1.0 6 0.166667
22 3.0 NaN 3 NaN
23 4.0 NaN 4 NaN
25 1.0 NaN 1 NaN
26 2.0 NaN 2 NaN
27 3.0 NaN 3 NaN
In [26]:
default_cat_table(nat, 'default_times')
Out[26]:
default Non-default Default ALL default_rate
default_times
0 739028.0 155135.0 894163 0.173497
1 532.0 2215.0 2747 0.806334
2 33.0 169.0 202 0.836634
3 3.0 19.0 22 0.863636
4 NaN 2.0 2 1.000000
5 NaN 1.0 1 1.000000

Locations and defaults

Zip code

In [27]:
nat.Zip5d.value_counts().head()
Out[27]:
10001    931
90015    926
99999    850
93401    756
90010    733
Name: Zip5d, dtype: int64
In [28]:
nat.Zip3d.value_counts().head()
Out[28]:
900    12463
770    10297
840     8055
750     7945
945     7115
Name: Zip3d, dtype: int64

Fips code

ZIP code is not ideal to use in plotting data in map, so here we convert the ZIP code to FIPS code

In [29]:
default_fips = default_cat_table(nat, 'fips')
default_fips[default_fips.ALL > 10].sort_values(['ALL', 'default_rate'], ascending = False).head()
Out[29]:
default Non-default Default ALL default_rate
fips
06037 30731.0 9211.0 39942 0.230609
17031 9811.0 3744.0 13555 0.276208
06073 10851.0 2133.0 12984 0.164279
48201 9882.0 2636.0 12518 0.210577
04013 9742.0 2673.0 12415 0.215304
In [30]:
fip_ct = nat.fips.value_counts()
fip_ct[fip_ct >= 10].shape
Out[30]:
(2542,)
In [31]:
default_fips = nat.groupby(['fips', 'default']).count().max(1).unstack()
default_fips['ALL'] = nat.fips.value_counts()
default_fips = default_fips.reset_index()
default_fips = default_fips.rename(columns = {'index': 'fips', 
                                              0: 'Non-default',
                                              1: 'Default',
                                              })
default_fips['Default'] = default_fips['Default'].fillna(0)
default_fips['default_rate'] = (default_fips['Default'] / default_fips['ALL']).round(3)
default_fips_1 = default_fips[default_fips.ALL >= 10]
In [32]:
default_fips.head()
Out[32]:
default fips Non-default Default ALL default_rate
0 01001 48.0 14.0 62 0.226
1 01003 256.0 67.0 323 0.207
2 01005 51.0 5.0 56 0.089
3 01007 41.0 8.0 49 0.163
4 01009 229.0 38.0 267 0.142
In [33]:
import plotly.figure_factory as ff

colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"]
In [34]:
binpoint = [500, 1000, 3000, 5000, 10000, 20000]
cs = colorscale[0::2]
fig = ff.create_choropleth(fips = default_fips.fips, values = default_fips.ALL, binning_endpoints = binpoint,
                          colorscale = cs, title='SBA cases', legend_title='no. of cases', round_legend_values=True
                          )
iplot(fig)
In [35]:
binpoint = [10, 100, 1000, 3000, 5000]
cs = colorscale[0::3]
fig = ff.create_choropleth(fips = default_fips.fips, values = default_fips.ALL, binning_endpoints = binpoint,
                          colorscale = cs, title='SBA default cases', legend_title='no. of default cases', round_legend_values=True
                          )
iplot(fig)
In [36]:
binpoint = [0.05, 0.1, 0.2, 0.35, 0.5]
cs = colorscale[0::3]
fig = ff.create_choropleth(fips = default_fips_1.fips, values = default_fips_1.default_rate, binning_endpoints = binpoint,
                          colorscale = cs, title='SBA Default rate', legend_title='Default rate'
                          )
iplot(fig)